library(ptairMS)
library(ptairData)

1 Introduction

The ptairMS package provides a workflow to process PTR-TOF-MS raw data in the open Hierarchical Data Format 5 (HDF5; .h5 extension), and generate the peak table as an ExpressionSet object for subsequent data analysis with the many methods and packages available in R.

We present in this vignette a quick handling of this package, and application to two dataset : exhaled air, mycobacteria culture air. For more details on the parameters choice and algorithm, see ptairMS vignette.

2 ptairData package

Two real PTR-TOF-MS (Proton Transfert Reaction Time Of Flight) raw data sets are available on the ptairData package, from

  • exhaled air of two healthy individal, three acquisitions per individual on diffrents day with two exprirations

  • mycobateria culture air, two replicates of each two diffrent species and one control (culture medium)

For memory reasons, the mass axis is truncate at the range: [20.4,21.6] U [50 , 150] for human data set, and [20.4 ; 21.6] U [56.4 ; 90.6] for mycobacteria.

3 Workflow

To process several files, a ptrSet object is first generated with the function createPtrSet from the directory containing the raw files, possibly grouped into subfolders according to classes of samples. The object will contain all information about the study: the sample metadata, one peak list for each samples, and several quality metrics about the files. Second, the peak lists are aligned between samples and an ExpressionSet (Biobase package object) object is returned , containing the peak table, sample metadata, and feature metadata.

4 Exhaled air analysis

4.1 Create a ptrSet object

We start from a directory :

dirRaw <- system.file("extdata/exhaledAir", package = "ptairData")
exhaledSet <- createPtrSet(dirRaw,
                     setName="exhaledSet",
                     mzCalibRef = c(21.022, 60.0525,75.04406),
                     fracMaxTIC = 0.9,
                     saveDir = NULL )
## ind1-1.h5  check
## ind1-2.h5  check
## ind1-3.h5  check
## ind2-1.h5  check
## ind2-2.h5  check
## ind2-3.h5  check

This function performs for each files :

  • calibration with mzCalibRef parameters

  • expiration detection, that correspond to the part of Total Ion Chromatogram where the intensity is higher than fracMaxTIC * max(TIC)

  • quantify th eprimary ion on the isotope H3(O18)+

  • provides a default sampleMetadata based on the tree structure of the directory

A summary plot to check the calibration error and reference calibration peak average shape:

plot(exhaledSet)

Interactive plot to see Total Ion Spectrum and expirations limits:

plotTIC(exhaledSet,showLimits = T,baselineRm = T)

4.2 detecting peak

After creating the ptrSet, we porcess the files with detectPeak function.

For each file in the data set, this function :

  • calibrates each expiration and ambiant air

  • detects peak in each expiration and ambiant air

  • aligns features between the background and diffrents expirations

  • keeps features present in at least fracGroup % of expirations

  • if normalize=TRUE, normalizes with the primary ion (devided all features by the quantity of H3(O18)+ *488), and converts in ppb if the ptr reaction information are available in the h5 file (voltage, temperature and pressure of the drift)

exhaledSet <- detectPeak(exhaledSet)
## ind1-1.h5 : 125 peaks detected 
## ind1-2.h5 : 133 peaks detected 
## ind1-3.h5 : 117 peaks detected 
## ind2-1.h5 : 133 peaks detected 
## ind2-2.h5 : 142 peaks detected 
## ind2-3.h5 : 132 peaks detected

This function can take few times if there are many files and a large mz range. You can then parallelize the algorithm by set parallel = TRUE and give the number of cores in nbCores (the number of files that will be processed simultaneous). To see the number of CPU cores available in your computer, use parallel::detectCores().

4.3 align Samples

Once the files have been all processed, we then align samples (i.e. matching of variables between the peak lists of the samples) with alignSamples function. It generate an expressionSet object, that contains the resulting peak table, the sample meta data (from ptrSet) and variable meta data containing the background peak table.

It is possible to apply two filters:

  • On the background: only variables with intensities greater than bgThreshold * background are kept

  • On samples reproducibility: only variables which are detected in more than \(fracGroup\) % of samples are kept.

If you do not want to apply those filters, set fracGroup and bgThreshold to 0.

eSet <- alignSamples(exhaledSet, group="subfolder", fracGroup = 1, bgThreshold = 4 )
## 33 peaks aligned

4.4 imputation

To imputue missing values, ptairMS returns back to the raw data, and compute with the same method as detectPeak the missing values without limit of detection.

eSet <- ptairMS::impute(eSet,  exhaledSet)
## ind1-1.h5 done
## ind1-2.h5 done
## ind1-3.h5 done

4.5 annotation

ptairMS propose a putative annotation based on the database saved in 'inst/extdata/reference_tables/vocDB_tables. The function write in the features metadata all possible formulas and names found for each features

eSet<-annotateVOC(eSet)

4.6 export the expressionSet

You can export the expressionSet with the function writeEset into 3 tabulated files ‘dataMatrix.tsv’, sampleMetadata.tsv’, and ‘variableMetadata.tsv’:

writeEset(eset, dirC = file.path(getwd(), "processed_dataset"))

4.7 Statistical analysis

You can finally use known package as ropls to performs PCA or OPLS analysis.

X<-Biobase::exprs(eSet)
SMD<-Biobase::pData(eSet)
features<-Biobase::fData(eSet)
ropls::view(log2(X),printL = FALSE)

pca<-ropls::opls(t(log2(X)),crossvalI=5,info.txtC="none",fig.pdfC="none")
plot(pca,type="x-score",parAsColFcVn=SMD$subfolder)

dim1<-ropls::getLoadingMN(pca)[,1]
barplot(sort(abs(dim1),decreasing = T))

knitr::kable(head(features[names(sort(abs(dim1),decreasing = T)),c("vocDB_formula_Hplus","vocDB_cas.name")]))
vocDB_formula_Hplus vocDB_cas.name
67.0538 [C5H6+H]+ 1,3-cyclopentadiene
70.0732
69.0699 [C5H8+H]+, [H8C5+H]+ (E)-, 1,3-pentadiene, 3-methyl-1-butyne, cyclopentene, isoprene, pentyne ui
109.0666 [C7H8O+H]+ 1-hydroxy-3-methylbenzene, 4-methylphenol , benzyl alcohol, p-cresol
68.0598
138.1339

plotFeatures plot for all files the average spectrum baseline corrected of all expirations, and background superimposed as a dashed line.

plotFeatures(exhaledSet,mz=51.0442,colorBy="subfolder",typePlot = "ggplot")

plotFeatures(exhaledSet,mz=67.0539,colorBy="subfolder",typePlot = "ggplot")

plotFeatures(exhaledSet,mz=137.131,colorBy="subfolder",typePlot = "ggplot")

5 Mycobacteria: head space analysis

dirRaw <- system.file("extdata/mycobacteria", package = "ptairData")
bacteriaSet <- createPtrSet(dirRaw,
                     setName="bacteriaSet",
                     mzCalibRef = c(21.022, 60.0525),
                     fracMaxTIC = 0.7,
                     saveDir = NULL )
## Control1.h5  check
## Control2.h5  check
## Specie-a1.h5  check
## Specie-a2.h5  check
## specie-b1.h5  check
## specie-b2.h5  check
plot(bacteriaSet)

plotTIC(bacteriaSet,showLimits = T,baselineRm = T)
bacteriaSet <- detectPeak(bacteriaSet)
## Control1.h5 : 37 peaks detected 
## Control2.h5 : 37 peaks detected 
## Specie-a1.h5 : 41 peaks detected 
## Specie-a2.h5 : 37 peaks detected 
## specie-b1.h5 : 50 peaks detected 
## specie-b2.h5 : 40 peaks detected
eSet <- alignSamples(bacteriaSet, group="subfolder", fracGroup = 1, bgThreshold = 4 )
## 19 peaks aligned
eSet <- ptairMS::impute(eSet,  bacteriaSet)
## specie-b1.h5 done
eSet<-annotateVOC(eSet)

You can know use known package as ropls to performs PCA or OPLS analysis.

X <- Biobase::exprs(eSet)
SMD <- Biobase::pData(eSet)
features <- Biobase::fData(eSet)
ropls::view(log2(X),printL = FALSE)

pca<-ropls::opls(t(log2(X)),crossvalI=5,predI=2,info.txtC="none",fig.pdfC="none")
plot(pca,type="x-score",parAsColFcVn=SMD$subfolder,parEllipsesL=F)

dim1<-ropls::getLoadingMN(pca)[,1]
barplot(sort(abs(dim1),decreasing = T))

knitr::kable(head(features[names(sort(abs(dim1),decreasing = T)),c("vocDB_formula_Hplus","vocDB_cas.name")]))
vocDB_formula_Hplus vocDB_cas.name
57.07 [C4H8+H]+, [H8C4+H]+ (E)-, (Z)-, 2-butene, 2-methyl-1-propene, but-1-ene, butene ui, cyclobutane
58.0722
87.0795 [C5H10O+H]+, [H10C5O+H]+ 2-methylbutanal, 2-pentanone, 3-methyl-2-buten-1-ol, 3-methyl-3-buten-1-ol, 3-methylbutanal, 3-pentanone, E-2-penten-1-ol, methylbutanal, pentanal
74.0661
73.0649 [C4H8O+H]+ 2-butanone, 2-buten-1-ol, 2-methylpropanal, butanal, tetrahydrofuran
85.1027 [C6H12+H]+ 1-hexene, 2-methyl-1-pentene, cyclohexane, methylcyclopentane
plotFeatures(bacteriaSet,mz=63.0299,colorBy="subfolder",typePlot = "ggplot")

plotFeatures(bacteriaSet,mz=77.0607,colorBy="subfolder",typePlot = "ggplot")

plotFeatures(bacteriaSet,mz=87.0795,colorBy="subfolder",typePlot = "ggplot")